The Logic of Regressions
Recall that the regression coefficient is calculated based on the
covariance of \(X\) and \(Y\). If the values of the observations in a
sample differ a lot, the variance is high, if the values are relatively
similar, the variance is low.
For instance, take the list of numbers 2, 9, 67. We calculate the
variance by summing the squared difference between each number and the
mean, and divide by the number of observations (minus one if the numbers
are a sample rather than a population):
\[
\sigma^2=\frac{\sum_{i=1}^n\left(x_i-\bar{x}\right)^2}{n-1}
\]
numbers <- c(2,9,67)
#to create a vactor containing the numbers 2, 9 and 67
numbers
[1] 2 9 67
num_mean <- mean(numbers)
#to compute the mean value for all the observations in the vector numbers
num_mean
[1] 26
num_var_sample <- 1/(3-1) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the sample variance
num_var_sample
[1] 1273
num_var_pop <- 1/(3) * ((2-num_mean)^2 + (9-num_mean)^2 + (67-num_mean)^2)
#to compute the population variance
num_var_pop
[1] 848.6667
# More simply:
var(numbers) # note var() automatically makes a small-sample adjustment
[1] 1273
The variance is high (849 if the numbers represent the whole
population, 1273 if it concerns a sample).
Recall that the variance represents the average squared distance
between observations and the mean. The standard deviation, accordingly,
is the root of this average distance: 36.
sd(numbers)
[1] 35.67913
To summarise the relationship between two variables, we can calculate
the correlation. Correlations are a standardised and
non-unit-specific way of measuring the relationship between two
variables. It represents the degree to which one variable is associated
with another.
cor(brexit_data$age_mean, brexit_data$Percent_Leave, use = "complete")
[1] 0.3825466
#the correlation is 0.38 on a scale from -1 to 1.
Since the correlation is positive, we know that as age increases so
too does the percentage of Leave voters in the regional
unit.
But how much does a one year increase in age add to the
Leave vote in general terms? Let’s use regression
to find out!
In linear regression analysis, we get the coefficient \(\beta\) by dividing the covariance of \(X\) and \(Y\) (the extent to which \(X\) and \(Y\) move together) by the variance of \(X\) (how much variation there is in \(X\), simply speaking). In other words:
To what extent do the different values that \(X\) takes explain the relationship between
\(X\) and \(Y\)?
\[ \hat{\beta}=\frac{\operatorname{COV}(X,
Y)}{\operatorname{VAR}(X)} \]
To calculate the coefficient manually, we start by calculating the
covariance of age and Leave voting:
age_leave_cov <- cov(brexit_data$Percent_Leave, brexit_data$age_mean, use = "complete")
#note the extra "use" command at the end.
#missing observations should be ignored, otherwise the command won't run.
age_leave_cov
[1] 10.96389
Next we can calculate the variance in our \(X\)-variable, i.e. age:
age_var <- var(brexit_data$age_mean, use = "complete")
age_var
[1] 8.515135
Finally, we can divide the covariance of \(X\) and \(Y\) by the variance of \(X\) to estimate the regression coefficient
\(\beta\):
beta <- age_leave_cov/age_var
beta
[1] 1.287577
We see that - based on our linear estimation - a one year increase in
the mean age of the electoral district is associated with
an increase in the Leave vote share of the area by
1.288 %-points on average.
Linear Regression in R
Doing this calculation manually, however, is time-consuming.
Fortunately, R can easily calculate the regression
coefficients for us (including the intercept value, too, which is
automatically added to the model):
mod1 <- lm(brexit_data$Percent_Leave ~ brexit_data$age_mean)
#note that the results are stored in the working environment. Using the output of the lm command is not the most efficient way to interpret the regression - feel free to try and compare it to the code below.
#Usually, we print the output using the summary command:
summary(mod1)
Call:
lm(formula = brexit_data$Percent_Leave ~ brexit_data$age_mean)
Residuals:
Min 1Q Median 3Q Max
-31.2068 -5.6726 0.5681 6.1777 22.2880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6113 6.7841 0.385 0.701
brexit_data$age_mean 1.2876 0.1682 7.657 1.97e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.088 on 342 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.1463, Adjusted R-squared: 0.1438
F-statistic: 58.63 on 1 and 342 DF, p-value: 1.97e-13
You can see that the coefficient value for age_mean is
the same as the manual value (\(\beta\)) we computed above.
Next, we might wonder how much of the variation in \(Y\) is explained by the variation in \(X\). We can find out by asking for the
R-squared, which broadly indicates what portion of the
variation is being explained by the regression model:
summary(mod1)$r.squared
[1] 0.1463419
We get an R-squared of 0.1463, which means that about
15 percent of the variation in the Leave vote
is explained by age. Regression tells us what accounts for the variation
in the outcome when the mean (in this case the mean Leave
vote) is not the best explanation for a value of \(Y\). This means that our model is “this
much better” than using the mean. Here, we conclude that
age does account for a meaningful part of the observed
variation: constituencies with a higher mean age are more
likely to vote Leave. So, we get a better
prediction of how a constituency votes by knowing the age of the voters
than just relying on the mean Leave vote.
Moreover, when we fit the our estimated linear regression line, we
can predict the Leave vote share for mean ages which are
not included in our data.
plot(brexit_data$Percent_Leave ~ brexit_data$age_mean, xlim =c(30,50),
ylim =c(3,80), ylab = "Leave vote share", xlab = "Mean age", pch=16, col="blue")
abline(mod1)

# Recall why we specify the arguments in the command above: We want to create a figure that is as insightful, clear and parsimonious as possible!
#xlim = what are the limits of the x-axis? Mean ages are between 30 and 50.
#we can find this out by
min(brexit_data$age_mean, na.rm=TRUE)
[1] 30.9
#alternative code:min(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
max(brexit_data$age_mean, na.rm=TRUE)
[1] 47.7
#alternative code:max(brexit_data$age_mean[!is.na(brexit_data$age_mean)])
#ylim = what are the limits of the y-axis?
min(brexit_data$Percent_Leave)
[1] 4.085381
max(brexit_data$Percent_Leave) #no missing values here
[1] 75.56243
#Leave vote share between 4 and 76
# xlab and ylab for naming the y and x axes
# pch and col just to make the dots prettier!
#if you just want to get a simple graph as a first idea, just run
plot(brexit_data$Percent_Leave ~ brexit_data$age_mean)
abline(mod1)

At the constituency mean age of 40.23808, we
predict a 54 percent vote share of Leave. You can
also calculate this manually without the graph by using the equation
\[y=bx\] Based on our simple
bivariate model, our estimated value of the Leave vote will
be
\[2.611 + (1.288*age_{mean}) = 2.611 +
(40.23808*1.288) = 54.44 \] which is very close to what we got by
looking at the graph.
Manual calculation:
#estimated % Leave share = 2.611 + 1.288 * mean age
mean(brexit_data$age_mean,na.rm = TRUE)
[1] 40.23808
beta # as we calculated before
[1] 1.287577
2.611+1.288*40.23808
[1] 54.43765
# Recall that you can call the estimated intercept and regression slope from the OLS model we saved earlier
mod1$coefficients[1]+mod1$coefficients[2]*mean(brexit_data$age_mean,na.rm = TRUE)
(Intercept)
54.42089
#estimated share at another value of age:
mod1$coefficients[1]+mod1$coefficients[2]*45
(Intercept)
60.55223
We now turn to a different question: Did immigration levels
correlate with the Leave vote - and, if so, can we predict
the Leave vote share based on the level of immigration in a
given constituency?
Let’s try to answer this by regressing the Leave vote
share on the percentage of UK-born residents in each constituency.
UKborn <- lm(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent)
summary(UKborn) #to print out the result
Call:
lm(formula = brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent)
Residuals:
Min 1Q Median 3Q Max
-20.4253 -4.7247 -0.0025 4.4336 23.4417
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.72722 3.63142 0.751 0.453
brexit_data$Birth_UK_percent 0.58210 0.04062 14.331 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.775 on 342 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.3752, Adjusted R-squared: 0.3734
F-statistic: 205.4 on 1 and 342 DF, p-value: < 2.2e-16
We get a coefficient of 0.58210 for
Birth_UK_percent. This means that a one percent increase in
the proportion of UK-born individuals is associated with an average 0.58
%-point rise in the Leave vote share. The more UK-born
residents live in the electoral district, the higher the
Leave vote!
Let’s also look at the correlation between these two variables:
cor(brexit_data$Percent_Leave, brexit_data$Birth_UK_percent, use="complete")
[1] 0.6125353
How do you interpret the regression coefficient? Is
it a better predictor than age? Let’s plot this and include our line of
best fit.
plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, pch=16, col="blue")
abline(UKborn)

Now, let’s try to look at this same relationship by creating a
variable for the share of foreign-born residents in each constituency,
before regressing the Leave vote on this newly created
variable. [Can you think of another way to create this
variable?]
brexit_data$foreignborn_percentage <- brexit_data$Birth_other_EU_percent +
brexit_data$Birth_Africa_percent +
brexit_data$Birth_MidEast_Asia_percent +
brexit_data$Birth_Americas_Carrib_percent +
brexit_data$Birth_Antarctica_Oceania_Other_percent
#foreignborn_percentage is the new variable
cor(brexit_data$foreignborn_percentage,
brexit_data$Percent_Leave, use="complete") #this time negative
[1] -0.5973909
foreigners <- lm(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage)
summary(foreigners)
Call:
lm(formula = brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage)
Residuals:
Min 1Q Median 3Q Max
-20.6996 -4.6138 0.0424 4.5006 23.9212
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 60.6691 0.6218 97.58 <2e-16 ***
brexit_data$foreignborn_percentage -0.6199 0.0450 -13.78 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.888 on 342 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.3569, Adjusted R-squared: 0.355
F-statistic: 189.8 on 1 and 342 DF, p-value: < 2.2e-16
#foreigners stores the results of this regression, to view just type "foreigners"
plot(brexit_data$Percent_Leave ~ brexit_data$foreignborn_percentage, pch=16, col="blue")
abline(foreigners)

A concern we may have at this point is whether our results are
possibly driven by an outlier. We can get a better idea about
what is influencing our estimates by asking R to label
our observations.
plot(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, type="n")
text(brexit_data$Percent_Leave ~ brexit_data$Birth_UK_percent, labels=brexit_data$Region)

We don’t really see much, but we can see that areas on the lower end
of both the Leave vote share and
UK-born residents are located in London. Let’s subset the
data to exclude London, and see if our estimates change.
#how many districts are in the London region?
length(brexit_data$Region[brexit_data$Region=="London"])
[1] 33
#we subset the data to exlude the 33 areas in London
withoutLondon <- subset(brexit_data, Region != "London")
dim(withoutLondon)
[1] 349 62
#The new number of observations is 349.
#This comes from substracting London's 33 areas form the original total of 382.
#and now we re-run the regression
immigWL <- lm(withoutLondon$Percent_Leave ~ withoutLondon$Birth_UK_percent)
summary(immigWL) #what happened to our coefficient?
Call:
lm(formula = withoutLondon$Percent_Leave ~ withoutLondon$Birth_UK_percent)
Residuals:
Min 1Q Median 3Q Max
-20.9939 -4.4872 -0.0867 4.4734 22.9108
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.16900 6.94355 1.321 0.188
withoutLondon$Birth_UK_percent 0.51244 0.07576 6.764 6.74e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.375 on 309 degrees of freedom
(38 observations deleted due to missingness)
Multiple R-squared: 0.129, Adjusted R-squared: 0.1261
F-statistic: 45.75 on 1 and 309 DF, p-value: 6.738e-11
What do you make of this estimate? How does it relate to your
previous model? Provide a substantive interpretation.
Regression models often provide the most important insights in data
analysis - it’s not by chance that they have become the gold standard of
social science analysis. Let’s present our regression result in a nice
way - we can use the stargazer package to create a
decent-looking regression table.
#install.packages("stargazer") #note the quotation marks
library(stargazer)
fit_example <- lm(c(1,2,3,4,5,6) ~ c(4,9,8,1,2,3))
stargazer(fit_example, type="text")
===============================================
Dependent variable:
---------------------------
c(1, 2, 3, 4, 5, 6)
-----------------------------------------------
c(4, 9, 8, 1, 2, 3) -0.308
(0.241)
Constant 4.888**
(1.301)
-----------------------------------------------
Observations 6
R2 0.291
Adjusted R2 0.113
Residual Std. Error 1.761 (df = 4)
F Statistic 1.640 (df = 1; 4)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(mod1,immigWL, type="text")
===================================================================
Dependent variable:
-----------------------------------------------
Percent_Leave Percent_Leave
(1) (2)
-------------------------------------------------------------------
age_mean 1.288***
(0.168)
Birth_UK_percent 0.512***
(0.076)
Constant 2.611 9.169
(6.784) (6.944)
-------------------------------------------------------------------
Observations 344 311
R2 0.146 0.129
Adjusted R2 0.144 0.126
Residual Std. Error 9.088 (df = 342) 7.375 (df = 309)
F Statistic 58.629*** (df = 1; 342) 45.747*** (df = 1; 309)
===================================================================
Note: *p<0.1; **p<0.05; ***p<0.01